Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Honggang Zhang

Reinforcement Learning-Based Heterogeneous Multi-Task Optimization in Semantic Broadcast Communications

Apr 28, 2025

Zhilin Lu, Rongpeng Li, Zhifeng Zhao, Honggang Zhang

Abstract:Semantic broadcast communications (Semantic BC) for image transmission have achieved significant performance gains for single-task scenarios. Nevertheless, extending these methods to multi-task scenarios remains challenging, as different tasks typically require distinct objective functions, leading to potential conflicts within the shared encoder. In this paper, we propose a tri-level reinforcement learning (RL)-based multi-task Semantic BC framework, termed SemanticBC-TriRL, which effectively resolves such conflicts and enables the simultaneous support of multiple downstream tasks at the receiver side, including image classification and content reconstruction tasks. Specifically, the proposed framework employs a bottom-up tri-level alternating learning strategy, formulated as a constrained multi-objective optimization problem. At the first level, task-specific decoders are locally optimized using supervised learning. At the second level, the shared encoder is updated via proximal policy optimization (PPO), guided by task-oriented rewards. At the third level, a multi-gradient aggregation-based task weighting module adaptively adjusts task priorities and steers the encoder optimization. Through this hierarchical learning process, the encoder and decoders are alternately trained, and the three levels are cohesively integrated via constrained learning objective. Besides, the convergence of SemanticBC-TriRL is also theoretically established. Extensive simulation results demonstrate the superior performance of the proposed framework across diverse channel conditions, particularly in low SNR regimes, and confirm its scalability with increasing numbers of receivers.

Via

Access Paper or Ask Questions

Separate Source Channel Coding Is Still What You Need: An LLM-based Rethinking

Jan 08, 2025

Tianqi Ren, Rongpeng Li, Ming-min Zhao, Xianfu Chen, Guangyi Liu, Yang Yang, Zhifeng Zhao, Honggang Zhang

Figure 1 for Separate Source Channel Coding Is Still What You Need: An LLM-based Rethinking

Figure 2 for Separate Source Channel Coding Is Still What You Need: An LLM-based Rethinking

Figure 3 for Separate Source Channel Coding Is Still What You Need: An LLM-based Rethinking

Figure 4 for Separate Source Channel Coding Is Still What You Need: An LLM-based Rethinking

Abstract:Along with the proliferating research interest in Semantic Communication (SemCom), Joint Source Channel Coding (JSCC) has dominated the attention due to the widely assumed existence in efficiently delivering information semantics. %has emerged as a pivotal area of research, aiming to enhance the efficiency and reliability of information transmission through deep learning-based methods. Nevertheless, this paper challenges the conventional JSCC paradigm, and advocates for adoption of Separate Source Channel Coding (SSCC) to enjoy the underlying more degree of freedom for optimization. We demonstrate that SSCC, after leveraging the strengths of Large Language Model (LLM) for source coding and Error Correction Code Transformer (ECCT) complemented for channel decoding, offers superior performance over JSCC. Our proposed framework also effectively highlights the compatibility challenges between SemCom approaches and digital communication systems, particularly concerning the resource costs associated with the transmission of high precision floating point numbers. Through comprehensive evaluations, we establish that empowered by LLM-based compression and ECCT-enhanced error correction, SSCC remains a viable and effective solution for modern communication systems. In other words, separate source and channel coding is still what we need!

Via

Access Paper or Ask Questions

Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models

Dec 17, 2024

YiFan Zhang, Shanglin Lei, Runqi Qiao, Zhuoma GongQue, Xiaoshuai Song, Guanting Dong, Qiuna Tan, Zhe Wei, Peiqing Yang, Ye Tian(+3 more)

Figure 1 for Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models

Figure 2 for Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models

Figure 3 for Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models

Figure 4 for Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models

Abstract:The rapidly developing field of large multimodal models (LMMs) has led to the emergence of diverse models with remarkable capabilities. However, existing benchmarks fail to comprehensively, objectively and accurately evaluate whether LMMs align with the diverse needs of humans in real-world scenarios. To bridge this gap, we propose the Multi-Dimensional Insights (MDI) benchmark, which includes over 500 images covering six common scenarios of human life. Notably, the MDI-Benchmark offers two significant advantages over existing evaluations: (1) Each image is accompanied by two types of questions: simple questions to assess the model's understanding of the image, and complex questions to evaluate the model's ability to analyze and reason beyond basic content. (2) Recognizing that people of different age groups have varying needs and perspectives when faced with the same scenario, our benchmark stratifies questions into three age categories: young people, middle-aged people, and older people. This design allows for a detailed assessment of LMMs' capabilities in meeting the preferences and needs of different age groups. With MDI-Benchmark, the strong model like GPT-4o achieve 79% accuracy on age-related tasks, indicating that existing LMMs still have considerable room for improvement in addressing real-world applications. Looking ahead, we anticipate that the MDI-Benchmark will open new pathways for aligning real-world personalization in LMMs. The MDI-Benchmark data and evaluation code are available at https://mdi-benchmark.github.io/

* 33 pages, 33 figures, Work in progress

Via

Access Paper or Ask Questions

VersaGen: Unleashing Versatile Visual Control for Text-to-Image Synthesis

Dec 17, 2024

Zhipeng Chen, Lan Yang, Yonggang Qi, Honggang Zhang, Kaiyue Pang, Ke Li, Yi-Zhe Song

Figure 1 for VersaGen: Unleashing Versatile Visual Control for Text-to-Image Synthesis

Figure 2 for VersaGen: Unleashing Versatile Visual Control for Text-to-Image Synthesis

Figure 3 for VersaGen: Unleashing Versatile Visual Control for Text-to-Image Synthesis

Figure 4 for VersaGen: Unleashing Versatile Visual Control for Text-to-Image Synthesis

Abstract:Despite the rapid advancements in text-to-image (T2I) synthesis, enabling precise visual control remains a significant challenge. Existing works attempted to incorporate multi-facet controls (text and sketch), aiming to enhance the creative control over generated images. However, our pilot study reveals that the expressive power of humans far surpasses the capabilities of current methods. Users desire a more versatile approach that can accommodate their diverse creative intents, ranging from controlling individual subjects to manipulating the entire scene composition. We present VersaGen, a generative AI agent that enables versatile visual control in T2I synthesis. VersaGen admits four types of visual controls: i) single visual subject; ii) multiple visual subjects; iii) scene background; iv) any combination of the three above or merely no control at all. We train an adaptor upon a frozen T2I model to accommodate the visual information into the text-dominated diffusion process. We introduce three optimization strategies during the inference phase of VersaGen to improve generation results and enhance user experience. Comprehensive experiments on COCO and Sketchy validate the effectiveness and flexibility of VersaGen, as evidenced by both qualitative and quantitative results.

* The paper has been accepted by AAAI 2025. Paper code: https://github.com/FelixChan9527/VersaGen_official

Via

Access Paper or Ask Questions

MERLOT: A Distilled LLM-based Mixture-of-Experts Framework for Scalable Encrypted Traffic Classification

Nov 20, 2024

Yuxuan Chen, Rongpeng Li, Zhifeng Zhao, Honggang Zhang

Figure 1 for MERLOT: A Distilled LLM-based Mixture-of-Experts Framework for Scalable Encrypted Traffic Classification

Figure 2 for MERLOT: A Distilled LLM-based Mixture-of-Experts Framework for Scalable Encrypted Traffic Classification

Figure 3 for MERLOT: A Distilled LLM-based Mixture-of-Experts Framework for Scalable Encrypted Traffic Classification

Figure 4 for MERLOT: A Distilled LLM-based Mixture-of-Experts Framework for Scalable Encrypted Traffic Classification

Abstract:We present MERLOT, a scalable mixture-of-expert (MoE) based refinement of distilled large language model optimized for encrypted traffic classification. By applying model distillation techniques in a teacher-student paradigm, compact models derived from GPT-2-base retain high classification accuracy while minimizing computational costs. These models function as specialized experts in an MoE architecture, dynamically assigned via a gating network. Unlike generation-based methods, our approach directly classifies encrypted traffic using the final decoder token with contextual feature embedding as input. Experiments on 10 datasets show superior or competitive performance over the state-of-the-art models while significantly reducing resource demands, underscoring its effectiveness and robustness.

Via

Access Paper or Ask Questions

Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation

Oct 14, 2024

Peiwen Sun, Sitong Cheng, Xiangtai Li, Zhen Ye, Huadai Liu, Honggang Zhang, Wei Xue, Yike Guo

Figure 1 for Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation

Figure 2 for Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation

Figure 3 for Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation

Figure 4 for Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation

Abstract:Recently, diffusion models have achieved great success in mono-channel audio generation. However, when it comes to stereo audio generation, the soundscapes often have a complex scene of multiple objects and directions. Controlling stereo audio with spatial contexts remains challenging due to high data costs and unstable generative models. To the best of our knowledge, this work represents the first attempt to address these issues. We first construct a large-scale, simulation-based, and GPT-assisted dataset, BEWO-1M, with abundant soundscapes and descriptions even including moving and multiple sources. Beyond text modality, we have also acquired a set of images and rationally paired stereo audios through retrieval to advance multimodal generation. Existing audio generation models tend to generate rather random and indistinct spatial audio. To provide accurate guidance for latent diffusion models, we introduce the SpatialSonic model utilizing spatial-aware encoders and azimuth state matrices to reveal reasonable spatial guidance. By leveraging spatial guidance, our unified model not only achieves the objective of generating immersive and controllable spatial audio from text and image but also enables interactive audio generation during inference. Finally, under fair settings, we conduct subjective and objective evaluations on simulated and real-world data to compare our approach with prevailing methods. The results demonstrate the effectiveness of our method, highlighting its capability to generate spatial audio that adheres to physical rules.

Via

Access Paper or Ask Questions

Unveiling and Mitigating Bias in Audio Visual Segmentation

Jul 23, 2024

Peiwen Sun, Honggang Zhang, Di Hu

Figure 1 for Unveiling and Mitigating Bias in Audio Visual Segmentation

Figure 2 for Unveiling and Mitigating Bias in Audio Visual Segmentation

Figure 3 for Unveiling and Mitigating Bias in Audio Visual Segmentation

Figure 4 for Unveiling and Mitigating Bias in Audio Visual Segmentation

Abstract:Community researchers have developed a range of advanced audio-visual segmentation models aimed at improving the quality of sounding objects' masks. While masks created by these models may initially appear plausible, they occasionally exhibit anomalies with incorrect grounding logic. We attribute this to real-world inherent preferences and distributions as a simpler signal for learning than the complex audio-visual grounding, which leads to the disregard of important modality information. Generally, the anomalous phenomena are often complex and cannot be directly observed systematically. In this study, we made a pioneering effort with the proper synthetic data to categorize and analyze phenomena as two types "audio priming bias" and "visual prior" according to the source of anomalies. For audio priming bias, to enhance audio sensitivity to different intensities and semantics, a perception module specifically for audio perceives the latent semantic information and incorporates information into a limited set of queries, namely active queries. Moreover, the interaction mechanism related to such active queries in the transformer decoder is customized to adapt to the need for interaction regulating among audio semantics. For visual prior, multiple contrastive training strategies are explored to optimize the model by incorporating a biased branch, without even changing the structure of the model. During experiments, observation demonstrates the presence and the impact that has been produced by the biases of the existing model. Finally, through experimental evaluation of AVS benchmarks, we demonstrate the effectiveness of our methods in handling both types of biases, achieving competitive performance across all three subsets.

* Accepted by ACM MM 24 (ORAL)

Via

Access Paper or Ask Questions

Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes

Jul 15, 2024

Yaoting Wang, Peiwen Sun, Dongzhan Zhou, Guangyao Li, Honggang Zhang, Di Hu

Abstract:Traditional reference segmentation tasks have predominantly focused on silent visual scenes, neglecting the integral role of multimodal perception and interaction in human experiences. In this work, we introduce a novel task called Reference Audio-Visual Segmentation (Ref-AVS), which seeks to segment objects within the visual domain based on expressions containing multimodal cues. Such expressions are articulated in natural language forms but are enriched with multimodal cues, including audio and visual descriptions. To facilitate this research, we construct the first Ref-AVS benchmark, which provides pixel-level annotations for objects described in corresponding multimodal-cue expressions. To tackle the Ref-AVS task, we propose a new method that adequately utilizes multimodal cues to offer precise segmentation guidance. Finally, we conduct quantitative and qualitative experiments on three test subsets to compare our approach with existing methods from related tasks. The results demonstrate the effectiveness of our method, highlighting its capability to precisely segment objects using multimodal-cue expressions. Dataset is available at \href{https://gewu-lab.github.io/Ref-AVS}{https://gewu-lab.github.io/Ref-AVS}.

* Accepted by ECCV2024

Via

Access Paper or Ask Questions

Can Textual Semantics Mitigate Sounding Object Segmentation Preference?

Jul 15, 2024

Yaoting Wang, Peiwen Sun, Yuanchao Li, Honggang Zhang, Di Hu

Abstract:The Audio-Visual Segmentation (AVS) task aims to segment sounding objects in the visual space using audio cues. However, in this work, it is recognized that previous AVS methods show a heavy reliance on detrimental segmentation preferences related to audible objects, rather than precise audio guidance. We argue that the primary reason is that audio lacks robust semantics compared to vision, especially in multi-source sounding scenes, resulting in weak audio guidance over the visual space. Motivated by the the fact that text modality is well explored and contains rich abstract semantics, we propose leveraging text cues from the visual scene to enhance audio guidance with the semantics inherent in text. Our approach begins by obtaining scene descriptions through an off-the-shelf image captioner and prompting a frozen large language model to deduce potential sounding objects as text cues. Subsequently, we introduce a novel semantics-driven audio modeling module with a dynamic mask to integrate audio features with text cues, leading to representative sounding object features. These features not only encompass audio cues but also possess vivid semantics, providing clearer guidance in the visual space. Experimental results on AVS benchmarks validate that our method exhibits enhanced sensitivity to audio when aided by text cues, achieving highly competitive performance on all three subsets. Project page: \href{https://github.com/GeWu-Lab/Sounding-Object-Segmentation-Preference}{https://github.com/GeWu-Lab/Sounding-Object-Segmentation-Preference}

* Accepted by ECCV2024

Via

Access Paper or Ask Questions

Communication and Control Co-Design in 6G: Sequential Decision-Making with LLMs

Jul 06, 2024

Xianfu Chen, Celimuge Wu, Yi Shen, Yusheng Ji, Tsutomu Yoshinaga, Qiang Ni, Charilaos C. Zarakovitis, Honggang Zhang

Figure 1 for Communication and Control Co-Design in 6G: Sequential Decision-Making with LLMs

Figure 2 for Communication and Control Co-Design in 6G: Sequential Decision-Making with LLMs

Figure 3 for Communication and Control Co-Design in 6G: Sequential Decision-Making with LLMs

Figure 4 for Communication and Control Co-Design in 6G: Sequential Decision-Making with LLMs

Abstract:This article investigates a control system within the context of six-generation wireless networks. The control performance optimization confronts the technical challenges that arise from the intricate interactions between communication and control sub-systems, asking for a co-design. Accounting for the system dynamics, we formulate the sequential co-design decision-makings of communication and control over the discrete time horizon as a Markov decision process, for which a practical offline learning framework is proposed. Our proposed framework integrates large language models into the elements of reinforcement learning. We present a case study on the age of semantics-aware communication and control co-design to showcase the potentials from our proposed learning framework. Furthermore, we discuss the open issues remaining to make our proposed offline learning framework feasible for real-world implementations, and highlight the research directions for future explorations.

Via

Access Paper or Ask Questions